Introduction

Principal component analysis is a dimensionality-reduction method, that is used with multi-dimensional data sets, by transforming the variables into smaller components, without eliminating much of the data. Even though some of the accuracy may be compromised, PCA is great for simplifying very complicated and large data sets and exploring overall patterns as well as preparing the dataset for data visualization.

Why is PCA useful?

The most important use of PCA is to represent a multivariate data table as smaller set of variables (summary indices) in order to observe trends, jumps, clusters and outliers. PCA is a very flexible tool and allows analysis of datasets that may contain, for example, multicollinearity, missing values, categorical data, and imprecise measurements. The goal of PCA is to extract the important information from the data and to express this information as a set of summary indices called principal components. (https://www.sartorius.com/en/knowledge/science-snippets/what-is-principal-component-analysis-pca-and-how-it-is-used-507186)

When to use PCA

Qualities of datasets that make it good for PCA

  • PCA works with datasets with NUMERIC variables, it is very hard to do distance calculations and standardize/normalize mixed datasets or categorical variables. There are better methods for doing what PCA does for mixed/categorical datasets.
  • PCA works best on datasets that have 3 or higher dimensions. Because, with higher dimensions, it becomes increasingly difficult for us to make interpretations from the resultant cloud of data. (https://www.analyticsvidhya.com/blog/2016/03/pca-practical-guide-principal-component-analysis-python/)

Generally PCA in R

The process of PCA can be broken down to 5 steps

Step 0: load and tidy the datasets

Let’s start by loading in the needed packages and datasets for this analysis. One package you might not be familiar with is the caret package, which has the findcorrelation() function. This function allows us to remove highly correlated predictors.

knitr::opts_chunk$set(echo = TRUE)

library(tidyverse)
library(dplyr)
library(caret)
library(corrplot)
library(readr)

f<-"https://raw.githubusercontent.com/mrpickett26/Final_Group_Project/main/WBCDSdata.csv"
wdbc<- read_csv(f, col_names=TRUE) ## load in the Wisconsin breast cancer dataset
wbg <- read.csv2("https://userpage.fu-berlin.de/soga/300/30100_data_sets/DWD.csv", stringsAsFactors = FALSE)

Now we should make the data into a tidier format where we can use it more easily, we also want to clean it up– in the weather dataset we can exclude some features before we even begin analysis to make it a little more straightforward. In this case we are looking at how the variables interact with each other regardless of ID number, so we can drop any ID variables.

wbg<- wbg%>%select(LAT, LON, ALTITUDE, RECORD.LENGTH, MEAN.ANNUAL.AIR.TEMP, MEAN.MONTHLY.MAX.TEMP, MEAN.MONTHLY.MIN.TEMP, MEAN.ANNUAL.RAINFALL, MEAN.ANNUAL.SUNSHINE, MEAN.ANNUAL.WIND.SPEED,MEAN.RANGE.AIR.TEMP, MEAN.CLOUD.COVER, MAX.RAINFALL) 

wbg<-na.omit(wbg)

The WBC data set has tumors classified as benign and malignant, it is useful to go ahead and break down the dataset into those two subgroups [Benign and Malignant] and subset them by response variable.

wdbc.data <- as.matrix(wdbc[,c(3:32)])
row.names(wdbc.data) <- wdbc$id
diagnosis <- as.numeric(wdbc$diagnosis == "Malignant") #Creates a new diagnosis vector
outcome<- wdbc$diagnosis

round(colMeans(wdbc.data),2) #It is also helpful to know the mean of the columns. This finds the means of each variable in the matrix. 
##             radius_mean            texture_mean          perimeter_mean 
##                   14.14                   19.28                   92.05 
##               area_mean         smoothness_mean        compactness_mean 
##                  655.72                    0.10                    0.10 
##          concavity_mean     concave points_mean           symmetry_mean 
##                    0.09                    0.05                    0.18 
##  fractal_dimension_mean               radius_se              texture_se 
##                    0.06                    0.41                    1.22 
##            perimeter_se                 area_se           smoothness_se 
##                    2.87                   40.37                    0.01 
##          compactness_se            concavity_se       concave points_se 
##                    0.03                    0.03                    0.01 
##             symmetry_se    fractal_dimension_se            radius_worst 
##                    0.02                    0.00                   16.28 
##           texture_worst         perimeter_worst              area_worst 
##                   25.67                  107.35                  881.66 
##        smoothness_worst       compactness_worst         concavity_worst 
##                    0.13                    0.25                    0.27 
##    concave points_worst          symmetry_worst fractal_dimension_worst 
##                    0.11                    0.29                    0.08
SD_var <- function(x){
    round(sd(x), 2)
}
apply(wdbc.data, 2, SD_var)
##             radius_mean            texture_mean          perimeter_mean 
##                    3.52                    4.30                   24.25 
##               area_mean         smoothness_mean        compactness_mean 
##                  351.66                    0.01                    0.05 
##          concavity_mean     concave points_mean           symmetry_mean 
##                    0.08                    0.04                    0.03 
##  fractal_dimension_mean               radius_se              texture_se 
##                    0.01                    0.28                    0.55 
##            perimeter_se                 area_se           smoothness_se 
##                    2.02                   45.52                    0.00 
##          compactness_se            concavity_se       concave points_se 
##                    0.02                    0.03                    0.01 
##             symmetry_se    fractal_dimension_se            radius_worst 
##                    0.01                    0.00                    4.83 
##           texture_worst         perimeter_worst              area_worst 
##                    6.15                   33.57                  569.28 
##        smoothness_worst       compactness_worst         concavity_worst 
##                    0.02                    0.16                    0.21 
##    concave points_worst          symmetry_worst fractal_dimension_worst 
##                    0.07                    0.06                    0.02
corMatrix <- wdbc[,c(3:32)]

#This function creates a correlation matrix. A correlation matrix is a table that demonstrates which variables have a linear relationship between them. 
M <- round(cor(corMatrix), 2)

#Allows for visualization of the correlation matrix in a correlation plot. 
corrplot(M, diag = FALSE, method="color", order="FPC", tl.srt = 90)

#This plot shows us that there are many variables that are correlated to one another 

Step 1: Standardizing all variables

Whenever working with different data types (e.g., different measurements, units, scales, percentages…etc.) it is crucial to standardize the variables before conducting any further analysis, so any variances are measured on the same scale. The goal is to make the variables comparable. Generally variables are scaled to have i) standard deviation one and ii) mean zero.

Statistically speaking this means z-scoring all the variables.

When working in R we can use the scale() function to standardize our variables, but for a refresher on normalization of data, we have included a documented scale function below.

knitr::opts_chunk$set(echo = TRUE)

scale_func<-function(x)
   {(x-mean(x))/sd(x)
}

#Now we can try it out using both the built in function and our function

#Using our scale function
wbg_funct_scale<-lapply(wbg[,sapply(wbg, is.numeric)], scale_func)
wdbc_funct_scale<-lapply(wdbc.data[sapply(wdbc.data, is.numeric)], scale_func)


#Using the built in R function
wbg_scale_built_in<-scale(wbg, center = TRUE, scale = TRUE)
wdbc_scale_built_in<-scale(wdbc.data, center = TRUE, scale = TRUE)

In the prcomp() functions we will take care of this by scale=TRUE and the princomp() function will take care of this when cor=TRUE. The princomp() and prcomp() represent two different methods of doing PCA. The two methods are

  1. Spectral decomposition which explores the covariances / correlations between variables (prcomp()) &
  2. Singular value decomposition which looks at the covariances / correlations between individuals or individual samples as a whole (princomp())

In this overview we will cover two data sets for PCA analysis. For one we will use 1. spectral decomposition for the weather dataset and we will use 2. single value decomposition for the breast cancer dataset

  1. The dataset we will use for the singular value decomposition is data from a weather station in Berlin, Germany. Each component corresponds to a particular variable related to the single weather station. This analysis is adapted from (https://www.geo.fu-berlin.de/en/v/soga/Geodata-analysis/Principal-Component-Analysis/PCA-an-example/Data-preparation/index.html).

  2. The dataset we will use for Single Value Decomposition is the Wisconsin Breast Cancer dataset. It provides features of Fine Needle Aspirates of breast cancer samples from patients from the University of Wisconsin Medical Center and has been commonly used in Machine Learning. The code used for this PCA analysis was adapted from (https://www.kaggle.com/code/shravank/predicting-breast-cancer-using-pca-lda-in-r/report).

Step 2: Creating a covariance Matrix

Next, we need to understand how each variable is different from the mean and see if there are any associations. To do this, we will need to create a covariance matrix, which is a p x p symmetric matrix (where p is the number of dimensions) that includes all the variables (covariates) and the initial variables.

Statistically it looks something like this.

We can create this covariance matrix by following these steps:

  1. Spectral decomposition
wbg.pcov <- prcomp(wbg, cor=FALSE, scores=TRUE) #error with infinite or missing values
summary(wbg.pcov) #Build the covariance matrix
## Importance of components:
##                            PC1      PC2      PC3      PC4     PC5     PC6
## Standard deviation     336.239 205.0131 133.5572 38.98558 3.76351 2.87229
## Proportion of Variance   0.648   0.2409   0.1022  0.00871 0.00008 0.00005
## Cumulative Proportion    0.648   0.8889   0.9911  0.99983 0.99991 0.99995
##                            PC7     PC8    PC9   PC10   PC11   PC12    PC13
## Standard deviation     2.15926 1.46936 0.8693 0.4508 0.3815 0.1251 0.05269
## Proportion of Variance 0.00003 0.00001 0.0000 0.0000 0.0000 0.0000 0.00000
## Cumulative Proportion  0.99998 0.99999 1.0000 1.0000 1.0000 1.0000 1.00000
cex.before <- par("cex")
par(cex = 0.7)
biplot(wbg.pcov) #Create a biplot of the covariance

  1. Singular value decomposition
wdbc.pcov <- princomp(wdbc.data, scores = TRUE)
summary(wdbc.pcov) #Build the covariance matrix
## Importance of components:
##                             Comp.1     Comp.2       Comp.3       Comp.4
## Standard deviation     665.3817048 85.4178639 26.510816702 7.3917203656
## Proportion of Variance   0.9820363  0.0161839  0.001558949 0.0001211928
## Cumulative Proportion    0.9820363  0.9982202  0.999779136 0.9999003291
##                              Comp.5       Comp.6       Comp.7       Comp.8
## Standard deviation     6.285379e+00 1.723376e+00 1.341902e+00 6.094794e-01
## Proportion of Variance 8.762919e-05 6.587882e-06 3.994177e-06 8.239556e-07
## Cumulative Proportion  9.999880e-01 9.999945e-01 9.999985e-01 9.999994e-01
##                              Comp.9      Comp.10      Comp.11      Comp.12
## Standard deviation     3.939371e-01 2.896455e-01 1.776972e-01 8.644681e-02
## Proportion of Variance 3.442227e-07 1.860884e-07 7.004017e-08 1.657615e-08
## Cumulative Proportion  9.999997e-01 9.999999e-01 1.000000e+00 1.000000e+00
##                             Comp.13      Comp.14      Comp.15      Comp.16
## Standard deviation     5.622228e-02 4.648825e-02 3.642125e-02 2.526125e-02
## Proportion of Variance 7.011366e-09 4.793715e-09 2.942358e-09 1.415453e-09
## Cumulative Proportion  1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
##                             Comp.17      Comp.18      Comp.19      Comp.20
## Standard deviation     1.935774e-02 1.528887e-02 1.357492e-02 1.271995e-02
## Proportion of Variance 8.311803e-10 5.184857e-10 4.087525e-10 3.588860e-10
## Cumulative Proportion  1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
##                             Comp.21      Comp.22      Comp.23      Comp.24
## Standard deviation     8.801581e-03 7.579317e-03 5.909075e-03 5.305210e-03
## Proportion of Variance 1.718332e-10 1.274224e-10 7.745061e-11 6.242966e-11
## Cumulative Proportion  1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
##                             Comp.25      Comp.26      Comp.27      Comp.28
## Standard deviation     3.978787e-03 3.530384e-03 1.917828e-03 1.675896e-03
## Proportion of Variance 3.511456e-11 2.764583e-11 8.158399e-12 6.229881e-12
## Cumulative Proportion  1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
##                             Comp.29      Comp.30
## Standard deviation     1.415907e-03 8.352584e-04
## Proportion of Variance 4.446880e-12 1.547489e-12
## Cumulative Proportion  1.000000e+00 1.000000e+00
cex.before <- par("cex")
par(cex = 0.7)
biplot(wdbc.pcov) #Create a biplot of the covariance

When working in R, we can also use the cov() function to create the covariate matrix

cov(wbg)
##                                 LAT          LON     ALTITUDE RECORD.LENGTH
## LAT                       3.9083311   0.31471758  -363.947574  1.114915e+00
## LON                       0.3147176   4.50335838    65.730720 -4.628255e-01
## ALTITUDE               -363.9475739  65.73071985 74843.841844 -6.789435e+02
## RECORD.LENGTH             1.1149147  -0.46282546  -678.943535  1.552087e+03
## MEAN.ANNUAL.AIR.TEMP      0.4051971  -0.84511327  -270.394549  1.246641e+00
## MEAN.MONTHLY.MAX.TEMP    -0.2918183  -0.63208114  -253.862668  1.823202e+00
## MEAN.MONTHLY.MIN.TEMP     0.8519915  -0.94548367  -253.174712  2.080328e+00
## MEAN.ANNUAL.RAINFALL   -203.0942769 -67.18623567 46507.369717 -1.163307e+03
## MEAN.ANNUAL.SUNSHINE    -31.8965038  53.50339561  3384.126415  4.527111e+02
## MEAN.ANNUAL.WIND.SPEED    0.6192575   0.08356332     1.499841  9.750910e-01
## MEAN.RANGE.AIR.TEMP      -1.1398222   0.31438218    -1.544904 -8.724906e-02
## MEAN.CLOUD.COVER          0.7452552  -0.52549989   108.449374  2.027231e+00
## MAX.RAINFALL             -8.5491138   2.50954407  1588.926450 -3.224174e+01
##                        MEAN.ANNUAL.AIR.TEMP MEAN.MONTHLY.MAX.TEMP
## LAT                               0.4051971            -0.2918183
## LON                              -0.8451133            -0.6320811
## ALTITUDE                       -270.3945494          -253.8626682
## RECORD.LENGTH                     1.2466408             1.8232018
## MEAN.ANNUAL.AIR.TEMP              1.5566522             1.6484184
## MEAN.MONTHLY.MAX.TEMP             1.6484184             2.1207512
## MEAN.MONTHLY.MIN.TEMP             1.3473533             1.1495423
## MEAN.ANNUAL.RAINFALL           -173.2623966          -174.1976859
## MEAN.ANNUAL.SUNSHINE             -5.4415859            -4.4152202
## MEAN.ANNUAL.WIND.SPEED           -0.2054398            -0.5114183
## MEAN.RANGE.AIR.TEMP               0.3040843             0.9678963
## MEAN.CLOUD.COVER                 -1.0099019            -1.4644658
## MAX.RAINFALL                     -5.4527988            -4.7877357
##                        MEAN.MONTHLY.MIN.TEMP MEAN.ANNUAL.RAINFALL
## LAT                               0.85199154         -203.0942769
## LON                              -0.94548367          -67.1862357
## ALTITUDE                       -253.17471249        46507.3697173
## RECORD.LENGTH                     2.08032847        -1163.3068722
## MEAN.ANNUAL.AIR.TEMP              1.34735332         -173.2623966
## MEAN.MONTHLY.MAX.TEMP             1.14954234         -174.1976859
## MEAN.MONTHLY.MIN.TEMP             1.46154721         -146.2282752
## MEAN.ANNUAL.RAINFALL           -146.22827520        56043.4069282
## MEAN.ANNUAL.SUNSHINE              6.21647139          132.2893354
## MEAN.ANNUAL.WIND.SPEED            0.06225097            0.4430323
## MEAN.RANGE.AIR.TEMP              -0.30549443          -28.5086495
## MEAN.CLOUD.COVER                 -0.66993168           84.7735733
## MAX.RAINFALL                     -5.09519502         1574.6057616
##                        MEAN.ANNUAL.SUNSHINE MEAN.ANNUAL.WIND.SPEED
## LAT                              -31.896504             0.61925751
## LON                               53.503396             0.08356332
## ALTITUDE                        3384.126415             1.49984098
## RECORD.LENGTH                    452.711084             0.97509096
## MEAN.ANNUAL.AIR.TEMP              -5.441586            -0.20543979
## MEAN.MONTHLY.MAX.TEMP             -4.415220            -0.51141834
## MEAN.MONTHLY.MIN.TEMP              6.216471             0.06225097
## MEAN.ANNUAL.RAINFALL             132.289335             0.44303234
## MEAN.ANNUAL.SUNSHINE           41952.486935            13.96920719
## MEAN.ANNUAL.WIND.SPEED            13.969207             0.50629723
## MEAN.RANGE.AIR.TEMP              -10.566639            -0.57305104
## MEAN.CLOUD.COVER                -129.914110             0.43609903
## MAX.RAINFALL                     120.852556            -0.34851665
##                        MEAN.RANGE.AIR.TEMP MEAN.CLOUD.COVER MAX.RAINFALL
## LAT                            -1.13982216        0.7452552   -8.5491138
## LON                             0.31438218       -0.5254999    2.5095441
## ALTITUDE                       -1.54490433      108.4493741 1588.9264496
## RECORD.LENGTH                  -0.08724906        2.0272307  -32.2417373
## MEAN.ANNUAL.AIR.TEMP            0.30408429       -1.0099019   -5.4527988
## MEAN.MONTHLY.MAX.TEMP           0.96789634       -1.4644658   -4.7877357
## MEAN.MONTHLY.MIN.TEMP          -0.30549443       -0.6699317   -5.0951950
## MEAN.ANNUAL.RAINFALL          -28.50864947       84.7735733 1574.6057616
## MEAN.ANNUAL.SUNSHINE          -10.56663868     -129.9141096  120.8525558
## MEAN.ANNUAL.WIND.SPEED         -0.57305104        0.4360990   -0.3485167
## MEAN.RANGE.AIR.TEMP             1.27212528       -0.7816299    0.3069282
## MEAN.CLOUD.COVER               -0.78162990        8.8445666    0.6550263
## MAX.RAINFALL                    0.30692822        0.6550263   58.5403404
cov(wdbc.data)
##                           radius_mean  texture_mean perimeter_mean
## radius_mean              1.236919e+01  4.975301e+00   8.510231e+01
## texture_mean             4.975301e+00  1.848283e+01   3.490911e+01
## perimeter_mean           8.510231e+01  3.490911e+01   5.880537e+02
## area_mean                1.221312e+03  4.912468e+02   8.413770e+03
## smoothness_mean          7.977424e-03 -1.011624e-03   6.755770e-02
## compactness_mean         9.368010e-02  5.442489e-02   7.112491e-01
## concavity_mean           1.894640e-01  1.046989e-01   1.382769e+00
## concave points_mean      1.121232e-01  4.951709e-02   7.999684e-01
## symmetry_mean            1.404566e-02  8.642084e-03   1.203885e-01
## fractal_dimension_mean  -7.811903e-03 -2.288540e-03  -4.524600e-02
## radius_se                6.646017e-01  3.297983e-01   4.668107e+00
## texture_se              -1.871470e-01  9.163532e-01  -1.148607e+00
## perimeter_se             4.808444e+00  2.456719e+00   3.408833e+01
## area_se                  1.179379e+02  5.112707e+01   8.232962e+02
## smoothness_se           -2.357825e-03  8.418747e-05  -1.480274e-02
## compactness_se           1.278925e-02  1.500579e-02   1.076835e-01
## concavity_se             2.033647e-02  1.893259e-02   1.651092e-01
## concave points_se        8.061288e-03  4.465472e-03   6.024435e-02
## symmetry_se             -2.974395e-03  2.674027e-04  -1.594145e-02
## fractal_dimension_se    -4.097096e-04  6.302523e-04  -4.345139e-04
## radius_worst             1.646623e+01  7.405394e+00   1.135286e+02
## texture_worst            6.497238e+00  2.410914e+01   4.570315e+01
## perimeter_worst          1.139490e+02  5.228339e+01   7.899822e+02
## area_worst               1.884673e+03  8.484444e+02   1.300149e+04
## smoothness_worst         9.164522e-03  8.017883e-03   8.037240e-02
## compactness_worst        2.275181e-01  1.901025e-01   1.730778e+00
## concavity_worst          3.850077e-01  2.731113e-01   2.842364e+00
## concave points_worst     1.714075e-01  8.470121e-02   1.225101e+00
## symmetry_worst           3.577533e-02  2.801888e-02   2.845696e-01
## fractal_dimension_worst  2.980312e-04  9.402207e-03   2.137503e-02
##                             area_mean smoothness_mean compactness_mean
## radius_mean              1.221312e+03    7.977424e-03     9.368010e-02
## texture_mean             4.912468e+02   -1.011624e-03     5.442489e-02
## perimeter_mean           8.413770e+03    6.755770e-02     7.112491e-01
## area_mean                1.236652e+05    8.411100e-01     9.230432e+00
## smoothness_mean          8.411100e-01    1.947699e-04     4.857460e-04
## compactness_mean         9.230432e+00    4.857460e-04     2.787592e-03
## concavity_mean           1.920452e+01    5.794142e-04     3.715166e-03
## concave points_mean      1.122083e+01    2.989204e-04     1.700989e-03
## symmetry_mean            1.443364e+00    2.136984e-04     8.716472e-04
## fractal_dimension_mean  -7.079805e-01    5.786523e-05     2.107603e-04
## radius_se                7.160073e+01    1.176340e-03     7.296583e-03
## texture_se              -1.271308e+01    5.479778e-04     1.371161e-03
## perimeter_se             5.176555e+02    8.409834e-03     5.868120e-02
## area_se                  1.281337e+04    1.563834e-01     1.094366e+00
## smoothness_se           -1.764078e-01    1.407167e-05     2.150829e-05
## compactness_se           1.324658e+00    7.886323e-05     6.976674e-04
## concavity_se             2.183140e+00    1.031759e-04     9.077057e-04
## concave points_se        8.000097e-01    3.218179e-05     2.083974e-04
## symmetry_se             -2.060626e-01    2.386345e-05     1.012450e-04
## fractal_dimension_se    -1.939842e-02    1.049477e-05     7.091231e-05
## radius_worst             1.634705e+03    1.398606e-02     1.361530e-01
## texture_worst            6.268507e+02    3.486180e-03     8.118962e-02
## perimeter_worst          1.132152e+04    1.093622e-01     1.044100e+00
## area_worst               1.920191e+05    1.610936e+00     1.528481e+01
## smoothness_worst         9.587572e-01    2.557838e-04     6.786073e-04
## compactness_worst        2.149580e+01    1.032656e-03     7.186655e-03
## concavity_worst          3.747290e+01    1.257345e-03     8.980402e-03
## concave points_worst     1.663529e+01    4.570223e-04     2.823966e-03
## symmetry_worst           3.128831e+00    3.434686e-04     1.669721e-03
## fractal_dimension_worst  1.244845e-02    1.260107e-04     6.553723e-04
##                         concavity_mean concave points_mean symmetry_mean
## radius_mean               1.894640e-01        1.121232e-01  1.404566e-02
## texture_mean              1.046989e-01        4.951709e-02  8.642084e-03
## perimeter_mean            1.382769e+00        7.999684e-01  1.203885e-01
## area_mean                 1.920452e+01        1.122083e+01  1.443364e+00
## smoothness_mean           5.794142e-04        2.989204e-04  2.136984e-04
## compactness_mean          3.715166e-03        1.700989e-03  8.716472e-04
## concavity_mean            6.352525e-03        2.847542e-03  1.092593e-03
## concave points_mean       2.847542e-03        1.504088e-03  4.909089e-04
## symmetry_mean             1.092593e-03        4.909089e-04  7.519769e-04
## fractal_dimension_mean    1.892722e-04        4.546765e-05  9.289783e-05
## radius_se                 1.399175e-02        7.522946e-03  2.309684e-03
## texture_se                3.390916e-03        4.788419e-04  1.948345e-03
## perimeter_se              1.065808e-01        5.582396e-02  1.741641e-02
## area_se                   2.239745e+00        1.218819e+00  2.789658e-01
## smoothness_se             2.365707e-05        3.240258e-06  1.545182e-05
## compactness_se            9.553362e-04        3.395918e-04  2.065482e-04
## concavity_se              1.661421e-03        5.125493e-04  2.827682e-04
## concave points_se         3.348335e-04        1.466384e-04  6.617709e-05
## symmetry_se               1.184893e-04        3.117593e-05  1.022081e-04
## fractal_dimension_se      9.478582e-05        2.640662e-05  2.407010e-05
## radius_worst              2.645798e-01        1.554065e-01  2.438192e-02
## texture_worst             1.479296e-01        7.034775e-02  1.548738e-02
## perimeter_worst           1.950250e+00        1.113827e+00  2.003430e-01
## area_worst                3.064051e+01        1.786553e+01  2.746316e+00
## smoothness_worst          8.117319e-04        3.981647e-04  2.658587e-04
## compactness_worst         9.456359e-03        4.065654e-03  2.037104e-03
## concavity_worst           1.468717e-02        6.078053e-03  2.474147e-03
## concave points_worst      4.503459e-03        2.315633e-03  7.722160e-04
## symmetry_worst            2.022598e-03        9.033578e-04  1.188916e-03
## fractal_dimension_worst   7.405979e-04        2.576523e-04  2.169195e-04
##                         fractal_dimension_mean    radius_se    texture_se
## radius_mean                      -7.811903e-03 6.646017e-01 -1.871470e-01
## texture_mean                     -2.288540e-03 3.297983e-01  9.163532e-01
## perimeter_mean                   -4.524600e-02 4.668107e+00 -1.148607e+00
## area_mean                        -7.079805e-01 7.160073e+01 -1.271308e+01
## smoothness_mean                   5.786523e-05 1.176340e-03  5.479778e-04
## compactness_mean                  2.107603e-04 7.296583e-03  1.371161e-03
## concavity_mean                    1.892722e-04 1.399175e-02  3.390916e-03
## concave points_mean               4.546765e-05 7.522946e-03  4.788419e-04
## symmetry_mean                     9.289783e-05 2.309684e-03  1.948345e-03
## fractal_dimension_mean            4.990897e-05 8.155118e-08  6.420351e-04
## radius_se                         8.155118e-08 7.703731e-02  3.268719e-02
## texture_se                        6.420351e-04 3.268719e-02  3.047739e-01
## perimeter_se                      5.673522e-04 5.463828e-01  2.494718e-01
## area_se                          -2.916037e-02 1.202801e+01  2.812626e+00
## smoothness_se                     8.537254e-06 1.372272e-04  6.590721e-04
## compactness_se                    7.076404e-05 1.770678e-03  2.300769e-03
## concavity_se                      9.513275e-05 2.785968e-03  3.264757e-03
## concave points_se                 1.480784e-05 8.795298e-04  7.896280e-04
## symmetry_se                       2.021486e-05 5.526559e-04  1.878035e-03
## fractal_dimension_se              1.287142e-05 1.673830e-04  4.094091e-04
## radius_worst                     -8.719987e-03 9.598717e-01 -2.957779e-01
## texture_worst                    -2.195928e-03 3.327686e-01  1.387449e+00
## perimeter_worst                  -4.909351e-02 6.716484e+00 -1.880635e+00
## area_worst                       -9.379469e-01 1.188502e+02 -2.594807e+01
## smoothness_worst                  8.124634e-05 8.987149e-04 -9.135689e-04
## compactness_worst                 5.092287e-04 1.254228e-02 -7.966557e-03
## concavity_worst                   5.089860e-04 2.204788e-02 -7.848439e-03
## concave points_worst              8.070968e-05 9.693555e-03 -4.303091e-03
## symmetry_worst                    1.461381e-04 1.624795e-03 -4.382460e-03
## fractal_dimension_worst           9.792278e-05 2.481958e-04 -4.506232e-04
##                         perimeter_se       area_se smoothness_se compactness_se
## radius_mean             4.808444e+00  1.179379e+02 -2.357825e-03   1.278925e-02
## texture_mean            2.456719e+00  5.112707e+01  8.418747e-05   1.500579e-02
## perimeter_mean          3.408833e+01  8.232962e+02 -1.480274e-02   1.076835e-01
## area_mean               5.176555e+02  1.281337e+04 -1.764078e-01   1.324658e+00
## smoothness_mean         8.409834e-03  1.563834e-01  1.407167e-05   7.886323e-05
## compactness_mean        5.868120e-02  1.094366e+00  2.150829e-05   6.976674e-04
## concavity_mean          1.065808e-01  2.239745e+00  2.365707e-05   9.553362e-04
## concave points_mean     5.582396e-02  1.218819e+00  3.240258e-06   3.395918e-04
## symmetry_mean           1.741641e-02  2.789658e-01  1.545182e-05   2.065482e-04
## fractal_dimension_mean  5.673522e-04 -2.916037e-02  8.537254e-06   7.076404e-05
## radius_se               5.463828e-01  1.202801e+01  1.372272e-04   1.770678e-03
## texture_se              2.494718e-01  2.812626e+00  6.590721e-04   2.300769e-03
## perimeter_se            4.094927e+00  8.638217e+01  9.188268e-04   1.508898e-02
## area_se                 8.638217e+01  2.072288e+03  1.028825e-02   2.316781e-01
## smoothness_se           9.188268e-04  1.028825e-02  9.030975e-06   1.814140e-05
## compactness_se          1.508898e-02  2.316781e-01  1.814140e-05   3.205028e-04
## concavity_se            2.214401e-02  3.714533e-01  2.440331e-05   4.327384e-04
## concave points_se       6.945238e-03  1.164564e-01  6.098430e-06   8.193121e-05
## symmetry_se             4.465255e-03  5.075277e-02  1.027967e-05   5.876332e-05
## fractal_dimension_se    1.307892e-03  1.528493e-02  3.401688e-06   3.809386e-05
## radius_worst            6.821309e+00  1.665616e+02 -3.351882e-03   1.749032e-02
## texture_worst           2.497005e+00  5.521291e+01 -1.382985e-03   1.594041e-02
## perimeter_worst         4.904584e+01  1.163852e+03 -2.195035e-02   1.552754e-01
## area_worst              8.423050e+02  2.103013e+04 -3.118533e-01   2.013895e+00
## smoothness_worst        5.990569e-03  1.288805e-01  2.160667e-05   9.158276e-05
## compactness_worst       1.088537e-01  2.023849e+00 -2.624262e-05   1.908929e-03
## concavity_worst         1.768535e-01  3.651061e+00 -3.651122e-05   2.382104e-03
## concave points_worst    7.381219e-02  1.607789e+00 -2.013785e-05   5.655949e-04
## symmetry_worst          1.377345e-02  2.088786e-01 -1.997406e-05   3.083044e-04
## fractal_dimension_worst 3.117648e-03  1.392871e-02  5.516455e-06   1.909854e-04
##                         concavity_se concave points_se   symmetry_se
## radius_mean             2.033647e-02      8.061288e-03 -2.974395e-03
## texture_mean            1.893259e-02      4.465472e-03  2.674027e-04
## perimeter_mean          1.651092e-01      6.024435e-02 -1.594145e-02
## area_mean               2.183140e+00      8.000097e-01 -2.060626e-01
## smoothness_mean         1.031759e-04      3.218179e-05  2.386345e-05
## compactness_mean        9.077057e-04      2.083974e-04  1.012450e-04
## concavity_mean          1.661421e-03      3.348335e-04  1.184893e-04
## concave points_mean     5.125493e-04      1.466384e-04  3.117593e-05
## symmetry_mean           2.827682e-04      6.617709e-05  1.022081e-04
## fractal_dimension_mean  9.513275e-05      1.480784e-05  2.021486e-05
## radius_se               2.785968e-03      8.795298e-04  5.526559e-04
## texture_se              3.264757e-03      7.896280e-04  1.878035e-03
## perimeter_se            2.214401e-02      6.945238e-03  4.465255e-03
## area_se                 3.714533e-01      1.164564e-01  5.075277e-02
## smoothness_se           2.440331e-05      6.098430e-06  1.027967e-05
## compactness_se          4.327384e-04      8.193121e-05  5.876332e-05
## concavity_se            9.110081e-04      1.433424e-04  7.769800e-05
## concave points_se       1.433424e-04      3.789372e-05  1.611135e-05
## symmetry_se             7.769800e-05      1.611135e-05  6.838511e-05
## fractal_dimension_se    5.814391e-05      9.973031e-06  8.098349e-06
## radius_worst            2.693275e-02      1.055707e-02 -5.053040e-03
## texture_worst           1.889507e-02      3.393192e-03 -3.994705e-03
## perimeter_worst         2.276234e-01      8.104002e-02 -2.834202e-02
## area_worst              3.208375e+00      1.191797e+00 -5.135231e-01
## smoothness_worst        1.139360e-04      2.950882e-05 -1.928150e-06
## compactness_worst       2.296133e-03      4.364883e-04  8.059110e-05
## concavity_worst         4.164549e-03      7.030495e-04  6.711721e-05
## concave points_worst    8.690687e-04      2.423883e-04 -1.529583e-05
## symmetry_worst          3.698595e-04      5.466730e-05  1.995319e-04
## fractal_dimension_worst 2.391802e-04      3.439887e-05  1.182683e-05
##                         fractal_dimension_se  radius_worst texture_worst
## radius_mean                    -4.097096e-04  1.646623e+01  6.497238e+00
## texture_mean                    6.302523e-04  7.405394e+00  2.410914e+01
## perimeter_mean                 -4.345139e-04  1.135286e+02  4.570315e+01
## area_mean                      -1.939842e-02  1.634705e+03  6.268507e+02
## smoothness_mean                 1.049477e-05  1.398606e-02  3.486180e-03
## compactness_mean                7.091231e-05  1.361530e-01  8.118962e-02
## concavity_mean                  9.478582e-05  2.645798e-01  1.479296e-01
## concave points_mean             2.640662e-05  1.554065e-01  7.034775e-02
## symmetry_mean                   2.407010e-05  2.438192e-02  1.548738e-02
## fractal_dimension_mean          1.287142e-05 -8.719987e-03 -2.195928e-03
## radius_se                       1.673830e-04  9.598717e-01  3.327686e-01
## texture_se                      4.094091e-04 -2.957779e-01  1.387449e+00
## perimeter_se                    1.307892e-03  6.821309e+00  2.497005e+00
## area_se                         1.528493e-02  1.665616e+02  5.521291e+01
## smoothness_se                   3.401688e-06 -3.351882e-03 -1.382985e-03
## compactness_se                  3.809386e-05  1.749032e-02  1.594041e-02
## concavity_se                    5.814391e-05  2.693275e-02  1.889507e-02
## concave points_se               9.973031e-06  1.055707e-02  3.393192e-03
## symmetry_se                     8.098349e-06 -5.053040e-03 -3.994705e-03
## fractal_dimension_se            7.012231e-06 -4.924591e-04 -4.366402e-05
## radius_worst                   -4.924591e-04  2.331941e+01  1.076728e+01
## texture_worst                  -4.366402e-05  1.076728e+01  3.780420e+01
## perimeter_worst                -1.751025e-04  1.610929e+02  7.593550e+01
## area_worst                     -3.540787e-02  2.705260e+03  1.217454e+03
## smoothness_worst                1.024745e-05  2.343166e-02  3.204282e-02
## compactness_worst               1.623794e-04  3.601883e-01  3.511252e-01
## concavity_worst                 2.096425e-04  5.765011e-01  4.754297e-01
## concave points_worst            3.729208e-05  2.492264e-01  1.465503e-01
## symmetry_worst                  1.821347e-05  7.291282e-02  8.879043e-02
## fractal_dimension_worst         2.828600e-05  8.012550e-03  2.447990e-02
##                         perimeter_worst    area_worst smoothness_worst
## radius_mean                1.139490e+02  1.884673e+03     9.164522e-03
## texture_mean               5.228339e+01  8.484444e+02     8.017883e-03
## perimeter_mean             7.899822e+02  1.300149e+04     8.037240e-02
## area_mean                  1.132152e+04  1.920191e+05     9.587572e-01
## smoothness_mean            1.093622e-01  1.610936e+00     2.557838e-04
## compactness_mean           1.044100e+00  1.528481e+01     6.786073e-04
## concavity_mean             1.950250e+00  3.064051e+01     8.117319e-04
## concave points_mean        1.113827e+00  1.786553e+01     3.981647e-04
## symmetry_mean              2.003430e-01  2.746316e+00     2.658587e-04
## fractal_dimension_mean    -4.909351e-02 -9.379469e-01     8.124634e-05
## radius_se                  6.716484e+00  1.188502e+02     8.987149e-04
## texture_se                -1.880635e+00 -2.594807e+01    -9.135689e-04
## perimeter_se               4.904584e+01  8.423050e+02     5.990569e-03
## area_se                    1.163852e+03  2.103013e+04     1.288805e-01
## smoothness_se             -2.195035e-02 -3.118533e-01     2.160667e-05
## compactness_se             1.552754e-01  2.013895e+00     9.158276e-05
## concavity_se               2.276234e-01  3.208375e+00     1.139360e-04
## concave points_se          8.104002e-02  1.191797e+00     2.950882e-05
## symmetry_se               -2.834202e-02 -5.135231e-01    -1.928150e-06
## fractal_dimension_se      -1.751025e-04 -3.540787e-02     1.024745e-05
## radius_worst               1.610929e+02  2.705260e+03     2.343166e-02
## texture_worst              7.593550e+01  1.217454e+03     3.204282e-02
## perimeter_worst            1.127034e+03  1.868385e+04     1.783764e-01
## area_worst                 1.868385e+04  3.240774e+05     2.677790e+00
## smoothness_worst           1.783764e-01  2.677790e+00     5.190617e-04
## compactness_worst          2.787733e+00  3.912686e+01     2.030516e-03
## concavity_worst            4.319295e+00  6.435724e+01     2.453895e-03
## concave points_worst       1.796511e+00  2.789769e+01     8.148531e-04
## symmetry_worst             5.609849e-01  7.376858e+00     6.985958e-04
## fractal_dimension_worst    8.333025e-02  8.058225e-01     2.541310e-04
##                         compactness_worst concavity_worst concave points_worst
## radius_mean                  2.275181e-01    3.850077e-01         1.714075e-01
## texture_mean                 1.901025e-01    2.731113e-01         8.470121e-02
## perimeter_mean               1.730778e+00    2.842364e+00         1.225101e+00
## area_mean                    2.149580e+01    3.747290e+01         1.663529e+01
## smoothness_mean              1.032656e-03    1.257345e-03         4.570223e-04
## compactness_mean             7.186655e-03    8.980402e-03         2.823966e-03
## concavity_mean               9.456359e-03    1.468717e-02         4.503459e-03
## concave points_mean          4.065654e-03    6.078053e-03         2.315633e-03
## symmetry_mean                2.037104e-03    2.474147e-03         7.722160e-04
## fractal_dimension_mean       5.092287e-04    5.089860e-04         8.070968e-05
## radius_se                    1.254228e-02    2.204788e-02         9.693555e-03
## texture_se                  -7.966557e-03   -7.848439e-03        -4.303091e-03
## perimeter_se                 1.088537e-01    1.768535e-01         7.381219e-02
## area_se                      2.023849e+00    3.651061e+00         1.607789e+00
## smoothness_se               -2.624262e-05   -3.651122e-05        -2.013785e-05
## compactness_se               1.908929e-03    2.382104e-03         5.655949e-04
## concavity_se                 2.296133e-03    4.164549e-03         8.690687e-04
## concave points_se            4.364883e-04    7.030495e-04         2.423883e-04
## symmetry_se                  8.059110e-05    6.711721e-05        -1.529583e-05
## fractal_dimension_se         1.623794e-04    2.096425e-04         3.729208e-05
## radius_worst                 3.601883e-01    5.765011e-01         2.492264e-01
## texture_worst                3.511252e-01    4.754297e-01         1.465503e-01
## perimeter_worst              2.787733e+00    4.319295e+00         1.796511e+00
## area_worst                   3.912686e+01    6.435724e+01         2.789769e+01
## smoothness_worst             2.030516e-03    2.453895e-03         8.148531e-04
## compactness_worst            2.473477e-02    2.924813e-02         8.261025e-03
## concavity_worst              2.924813e-02    4.346996e-02         1.169645e-02
## concave points_worst         8.261025e-03    1.169645e-02         4.305155e-03
## symmetry_worst               5.990521e-03    6.883953e-03         2.046631e-03
## fractal_dimension_worst      2.302582e-03    2.584829e-03         6.051249e-04
##                         symmetry_worst fractal_dimension_worst
## radius_mean               3.577533e-02            2.980312e-04
## texture_mean              2.801888e-02            9.402207e-03
## perimeter_mean            2.845696e-01            2.137503e-02
## area_mean                 3.128831e+00            1.244845e-02
## smoothness_mean           3.434686e-04            1.260107e-04
## compactness_mean          1.669721e-03            6.553723e-04
## concavity_mean            2.022598e-03            7.405979e-04
## concave points_mean       9.033578e-04            2.576523e-04
## symmetry_mean             1.188916e-03            2.169195e-04
## fractal_dimension_mean    1.461381e-04            9.792278e-05
## radius_se                 1.624795e-03            2.481958e-04
## texture_se               -4.382460e-03           -4.506232e-04
## perimeter_se              1.377345e-02            3.117648e-03
## area_se                   2.088786e-01            1.392871e-02
## smoothness_se            -1.997406e-05            5.516455e-06
## compactness_se            3.083044e-04            1.909854e-04
## concavity_se              3.698595e-04            2.391802e-04
## concave points_se         5.466730e-05            3.439887e-05
## symmetry_se               1.995319e-04            1.182683e-05
## fractal_dimension_se      1.821347e-05            2.828600e-05
## radius_worst              7.291282e-02            8.012550e-03
## texture_worst             8.879043e-02            2.447990e-02
## perimeter_worst           5.609849e-01            8.333025e-02
## area_worst                7.376858e+00            8.058225e-01
## smoothness_worst          6.985958e-04            2.541310e-04
## compactness_worst         5.990521e-03            2.302582e-03
## concavity_worst           6.883953e-03            2.584829e-03
## concave points_worst      2.046631e-03            6.051249e-04
## symmetry_worst            3.834318e-03            6.019829e-04
## fractal_dimension_worst   6.019829e-04            3.264600e-04

Step 3: Computing the eigenvectors and eigenvalues

Eigenvectors and eigenvalues are linear transformations of a nonzero vector that allow us to determine the principal components of the data.

Therefore, this new way of organizing the data will allow us to reduce complexity (dimensionality) without losing much information.

Computationally, this looks like this:

After creating the biplots in the previous step, we can see that area_worst and area_mean have much larger covariances than the other variables included. This is due to a scale issue and therefore indicates to us that we need to scale those variables for future PCA analysis. Let’s calculate the eigenvectors and eigenvalues in the following steps.

par(cex = cex.before)
par(mfrow = c(1, 2)) #Set up a grid for PCA analysis
pr.cvar <- wbg.pcov$sdev ^ 2 #Calculate variability of each variable
pve_cov_wbg <- pr.cvar/sum(pr.cvar) #Calculate the variance as explained by each principle component

#We now need to calculate the eigen values and percent variance of each variable or "component" 
round(pr.cvar, 2) #Eigen Values
##  [1] 113056.40  42030.38  17837.52   1519.88     14.16      8.25      4.66
##  [8]      2.16      0.76      0.20      0.15      0.02      0.00
round(pve_cov_wbg, 2) # Percent Variance
##  [1] 0.65 0.24 0.10 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
#use the prcomp() function to calculate the eigenvectors
wbg.pr <- prcomp(wbg, scale = TRUE, center = TRUE)
summary(wbg.pr)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.1711 1.6378 1.1716 1.10492 1.00608 0.83074 0.71444
## Proportion of Variance 0.3626 0.2063 0.1056 0.09391 0.07786 0.05309 0.03926
## Cumulative Proportion  0.3626 0.5689 0.6745 0.76842 0.84629 0.89937 0.93864
##                            PC8     PC9    PC10    PC11    PC12    PC13
## Standard deviation     0.65762 0.52192 0.24515 0.14647 0.09785 0.04161
## Proportion of Variance 0.03327 0.02095 0.00462 0.00165 0.00074 0.00013
## Cumulative Proportion  0.97190 0.99286 0.99748 0.99913 0.99987 1.00000
######wdbc dataset

par(cex = cex.before)
par(mfrow = c(1, 2)) #Set up a grid for PCA analysis
pr.cvar <- wdbc.pcov$sdev ^ 2 #Calculate variability of each variable
pve_cov <- pr.cvar/sum(pr.cvar) #Calculate the variance as explained by each principle component

# We now need to calculate the eigen values and percent variance of each variable or "component" 
round(pr.cvar, 2) #Eigen Values
##    Comp.1    Comp.2    Comp.3    Comp.4    Comp.5    Comp.6    Comp.7    Comp.8 
## 442732.81   7296.21    702.82     54.64     39.51      2.97      1.80      0.37 
##    Comp.9   Comp.10   Comp.11   Comp.12   Comp.13   Comp.14   Comp.15   Comp.16 
##      0.16      0.08      0.03      0.01      0.00      0.00      0.00      0.00 
##   Comp.17   Comp.18   Comp.19   Comp.20   Comp.21   Comp.22   Comp.23   Comp.24 
##      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00 
##   Comp.25   Comp.26   Comp.27   Comp.28   Comp.29   Comp.30 
##      0.00      0.00      0.00      0.00      0.00      0.00
round(pve_cov, 2) # Percent Variance
##  Comp.1  Comp.2  Comp.3  Comp.4  Comp.5  Comp.6  Comp.7  Comp.8  Comp.9 Comp.10 
##    0.98    0.02    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
## Comp.11 Comp.12 Comp.13 Comp.14 Comp.15 Comp.16 Comp.17 Comp.18 Comp.19 Comp.20 
##    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00 
## Comp.21 Comp.22 Comp.23 Comp.24 Comp.25 Comp.26 Comp.27 Comp.28 Comp.29 Comp.30 
##    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
#use the prcomp() function to calculate the eigenvectors
wdbc.pr <- prcomp(wdbc.data, scale = TRUE, center = TRUE)
summary(wdbc.pr)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5    PC6     PC7
## Standard deviation     3.6430 2.3887 1.67894 1.40544 1.28662 1.0982 0.81949
## Proportion of Variance 0.4424 0.1902 0.09396 0.06584 0.05518 0.0402 0.02239
## Cumulative Proportion  0.4424 0.6326 0.72653 0.79237 0.84755 0.8878 0.91013
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.68973 0.64618 0.59266 0.54282 0.51175 0.49126 0.39418
## Proportion of Variance 0.01586 0.01392 0.01171 0.00982 0.00873 0.00804 0.00518
## Cumulative Proportion  0.92599 0.93991 0.95162 0.96144 0.97017 0.97821 0.98339
##                           PC15    PC16    PC17    PC18    PC19    PC20   PC21
## Standard deviation     0.30696 0.28022 0.24367 0.22980 0.22256 0.17656 0.1729
## Proportion of Variance 0.00314 0.00262 0.00198 0.00176 0.00165 0.00104 0.0010
## Cumulative Proportion  0.98653 0.98915 0.99113 0.99289 0.99454 0.99558 0.9966
##                           PC22    PC23   PC24    PC25    PC26    PC27    PC28
## Standard deviation     0.16547 0.15629 0.1344 0.12458 0.08929 0.08295 0.03993
## Proportion of Variance 0.00091 0.00081 0.0006 0.00052 0.00027 0.00023 0.00005
## Cumulative Proportion  0.99749 0.99830 0.9989 0.99942 0.99969 0.99992 0.99997
##                           PC29    PC30
## Standard deviation     0.02728 0.01153
## Proportion of Variance 0.00002 0.00000
## Cumulative Proportion  1.00000 1.00000

Step 4: Feture vector extraction

We can now use these metrics to calculate the cumulative percentages. Cumulative percentage is another way of expressing frequency distribution. It calculates the percentage of the cumulative frequency within each interval, much as relative frequency distribution calculates the percentage of frequency. The main advantage of cumulative percentage over cumulative frequency as a measure of frequency distribution is that it provides an easier way to compare different sets of data. Cumulative percentage is calculated by dividing the cumulative frequency by the total number of observations (n), then multiplying it by 100 (the last value will always be equal to 100%).

It is useful in PCA to create a scree plot of covariance and cumulative proportion of variance explained. In multivariate statistics, a scree plot is a line plot of the eigenvalues of factors or principal components in an analysis. The scree plot is used to determine the number of factors to retain in an exploratory factor analysis (FA) or principal components to keep in a principal component analysis (PCA).

We can begin this step by looking at the German dataset.

round(cumsum(pve_cov_wbg), 2) #Cumulative percentage
##  [1] 0.65 0.89 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
# Plot variance explained for each principal component
plot(pve_cov_wbg, xlab = "Principal Component", 
     ylab = "Proportion of Variance Explained", 
     ylim = c(0, 1), type = "b")

# Plot cumulative proportion of variance explained
plot(cumsum(pve_cov_wbg), xlab = "Principal Component", 
     ylab = "Cumulative Proportion of Variance Explained", 
     ylim = c(0, 1), type = "b")

Now the Wisconsin dataset.

round(cumsum(pve_cov), 2) #Cumulative percentage
##  Comp.1  Comp.2  Comp.3  Comp.4  Comp.5  Comp.6  Comp.7  Comp.8  Comp.9 Comp.10 
##    0.98    1.00    1.00    1.00    1.00    1.00    1.00    1.00    1.00    1.00 
## Comp.11 Comp.12 Comp.13 Comp.14 Comp.15 Comp.16 Comp.17 Comp.18 Comp.19 Comp.20 
##    1.00    1.00    1.00    1.00    1.00    1.00    1.00    1.00    1.00    1.00 
## Comp.21 Comp.22 Comp.23 Comp.24 Comp.25 Comp.26 Comp.27 Comp.28 Comp.29 Comp.30 
##    1.00    1.00    1.00    1.00    1.00    1.00    1.00    1.00    1.00    1.00
# Plot variance explained for each principal component
plot(pve_cov, xlab = "Principal Component", 
     ylab = "Proportion of Variance Explained", 
     ylim = c(0, 1), type = "b")

# Plot cumulative proportion of variance explained
plot(cumsum(pve_cov), xlab = "Principal Component", 
     ylab = "Cumulative Proportion of Variance Explained", 
     ylim = c(0, 1), type = "b")

In the scree plots the first two or three PCs have capture most of the information. A scree plot shows how much variation each PC captures from the data, where the y axis represents the eigenvalues, which are the amount of variation. A scree plot is used to select which principal components to keep. An ideal curve should start as steep, then bends at an “elbow” and finally flatten out. Our covariance matrix does not shows this trend for the Germany weather data, but does not for the wisconsin breast cancer dataset. Therefore we will move to a correlation matrix to see if PC selection outcomes can be improved for this wisconsin breast cancer dataset.

Step 5: Recasting the data along the principal components axes

By using the selected feature vector we can finally reorient the WBC data using the axes from the principal components.

# Set up 1 x 2 plotting grid
par(mfrow = c(1, 2))

# Calculate variability of each component
pr.var <- wdbc.pr$sdev ^ 2

# Assign names to the columns to be consistent with princomp.
# This is done for reporting purposes.
names(pr.var) <- names(pr.cvar)

# Variance explained by each principal component: pve
pve <- pr.var/sum(pr.var)

# Assign names to the columns as it is not done by default.
# This is done to be consistent with princomp.
names(pve) <- names(pve_cov)

round(pr.var, 2) #Eigen values 
##  Comp.1  Comp.2  Comp.3  Comp.4  Comp.5  Comp.6  Comp.7  Comp.8  Comp.9 Comp.10 
##   13.27    5.71    2.82    1.98    1.66    1.21    0.67    0.48    0.42    0.35 
## Comp.11 Comp.12 Comp.13 Comp.14 Comp.15 Comp.16 Comp.17 Comp.18 Comp.19 Comp.20 
##    0.29    0.26    0.24    0.16    0.09    0.08    0.06    0.05    0.05    0.03 
## Comp.21 Comp.22 Comp.23 Comp.24 Comp.25 Comp.26 Comp.27 Comp.28 Comp.29 Comp.30 
##    0.03    0.03    0.02    0.02    0.02    0.01    0.01    0.00    0.00    0.00
round(pve, 2) #Percent variance explained
##  Comp.1  Comp.2  Comp.3  Comp.4  Comp.5  Comp.6  Comp.7  Comp.8  Comp.9 Comp.10 
##    0.44    0.19    0.09    0.07    0.06    0.04    0.02    0.02    0.01    0.01 
## Comp.11 Comp.12 Comp.13 Comp.14 Comp.15 Comp.16 Comp.17 Comp.18 Comp.19 Comp.20 
##    0.01    0.01    0.01    0.01    0.00    0.00    0.00    0.00    0.00    0.00 
## Comp.21 Comp.22 Comp.23 Comp.24 Comp.25 Comp.26 Comp.27 Comp.28 Comp.29 Comp.30 
##    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
round(cumsum(pve), 2) # Cummulative percent explained
##  Comp.1  Comp.2  Comp.3  Comp.4  Comp.5  Comp.6  Comp.7  Comp.8  Comp.9 Comp.10 
##    0.44    0.63    0.73    0.79    0.85    0.89    0.91    0.93    0.94    0.95 
## Comp.11 Comp.12 Comp.13 Comp.14 Comp.15 Comp.16 Comp.17 Comp.18 Comp.19 Comp.20 
##    0.96    0.97    0.98    0.98    0.99    0.99    0.99    0.99    0.99    1.00 
## Comp.21 Comp.22 Comp.23 Comp.24 Comp.25 Comp.26 Comp.27 Comp.28 Comp.29 Comp.30 
##    1.00    1.00    1.00    1.00    1.00    1.00    1.00    1.00    1.00    1.00
plot(pve, xlab = "Principal Component", 
     ylab = "Proportion of Variance Explained", 
     ylim = c(0, 1), type = "b")

# Plot cumulative proportion of variance explained
plot(cumsum(pve), xlab = "Principal Component", 
     ylab = "Cumulative Proportion of Variance Explained", 
     ylim = c(0, 1), type = "b")

A majority of the variation is explained by the first six PC’s and the eigen values associated with the first 6 PC’s are greater than 1. This is the selection criteria we will use for the rest of the WBC data PCA.

Another way to determine the number of factors to use in PCA is through the preprocess() function in the caret package. This by default keeps the components that explain 95% of the variance but can be changed using the thresh or pcaComp parameter.

preProc <- preProcess(wbg,method="pca",thresh = 0.95) #Can substitutr in thresh with PCAcomp to change number of components you want to account for 
predictPC <- predict(preProc, wbg)
plot(predictPC[,3],predictPC[,4],col=blues9)

This result can also be represented by the graph above where most of the points are clustered on the y-axis around 0. This indicated to us that the variables can probably be combined and retain most of the information found from both variables, separately. By changing out the variables in the caret package (what is included in the brackets), we can better understand what is happening/the relationship between the variables which assists in preprocessing for dimensionality reduction.

Sub in different variables:

preProc2 <- preProcess(wbg,method="pca",thresh = 0.95) #Can substitutr in thresh with PCAcomp to change number of components you want to account for 
predictPC2 <- predict(preProc2, wbg)
plot(predictPC2[,4],predictPC2[,7],col=blues9)

PCA in less steps

When working in R, we can skip a lot of these steps by running the prcomp() function and setting the arguments center and scale to be true. Then the summary() will give us the best model(s).

wbg.pca<-prcomp(wbg, center=TRUE, scale =TRUE)

wdbc.pca<-princomp(wdbc.data, center=TRUE, scale =TRUE)

wbg.pca
## Standard deviations (1, .., p=13):
##  [1] 2.17105943 1.63781730 1.17162926 1.10492374 1.00608224 0.83074079
##  [7] 0.71444465 0.65761504 0.52192250 0.24515367 0.14647325 0.09784617
## [13] 0.04161354
## 
## Rotation (n x k) = (13 x 13):
##                                PC1         PC2         PC3         PC4
## LAT                     0.20502759 -0.45993151 -0.14883206 -0.12248527
## LON                    -0.10401828 -0.01594743 -0.73055743 -0.12278093
## ALTITUDE               -0.42758109  0.11298066  0.06009671  0.06273283
## RECORD.LENGTH           0.04067894 -0.03117858 -0.14533064  0.02692803
## MEAN.ANNUAL.AIR.TEMP    0.42393848  0.13805683  0.16674865  0.11650671
## MEAN.MONTHLY.MAX.TEMP   0.36582631  0.34459688  0.05712672  0.02664235
## MEAN.MONTHLY.MIN.TEMP   0.39245789 -0.08381029  0.25490700  0.25722911
## MEAN.ANNUAL.RAINFALL   -0.36811710  0.07858829  0.34376283  0.17309381
## MEAN.ANNUAL.SUNSHINE   -0.01921069  0.01176214 -0.30245296  0.69344905
## MEAN.ANNUAL.WIND.SPEED -0.05562375 -0.50547309 -0.01001127  0.21256393
## MEAN.RANGE.AIR.TEMP     0.05251610  0.53325124 -0.19910592 -0.24175182
## MEAN.CLOUD.COVER       -0.11003092 -0.23633688  0.23935224 -0.48180107
## MAX.RAINFALL           -0.37583188  0.16616730  0.12912218  0.19549566
##                                 PC5         PC6         PC7          PC8
## LAT                     0.133063365 -0.10411263  0.02133133  0.572509801
## LON                     0.171401184 -0.09549211 -0.52716324 -0.092655671
## ALTITUDE               -0.064344175  0.02243782  0.14030639 -0.344128929
## RECORD.LENGTH          -0.949843888 -0.23719866 -0.09034671  0.089655470
## MEAN.ANNUAL.AIR.TEMP    0.022723368  0.01022428 -0.26359329 -0.097581198
## MEAN.MONTHLY.MAX.TEMP  -0.002408029  0.05550861 -0.22721810 -0.081824847
## MEAN.MONTHLY.MIN.TEMP   0.016759352 -0.05625575 -0.32164108 -0.093739569
## MEAN.ANNUAL.RAINFALL    0.034749214 -0.13704659 -0.23410563  0.377119294
## MEAN.ANNUAL.SUNSHINE   -0.077133928  0.61282883  0.08250589  0.194513602
## MEAN.ANNUAL.WIND.SPEED  0.002800018 -0.06962849 -0.12796631 -0.554985004
## MEAN.RANGE.AIR.TEMP    -0.024823742  0.13396876  0.04443454 -0.003703994
## MEAN.CLOUD.COVER       -0.186404458  0.69185025 -0.35748300 -0.004475645
## MAX.RAINFALL            0.056932708 -0.14536586 -0.51134767  0.155188656
##                                 PC9         PC10         PC11         PC12
## LAT                     0.204395502 -0.084580122 -0.550912309  0.004442119
## LON                    -0.203936888  0.268528888  0.025347358  0.028508656
## ALTITUDE               -0.234686588  0.127321765 -0.761672693  0.050518927
## RECORD.LENGTH           0.006676567  0.011437052 -0.019642642  0.017161328
## MEAN.ANNUAL.AIR.TEMP   -0.006669925  0.039153700 -0.138682827  0.811801072
## MEAN.MONTHLY.MAX.TEMP   0.191809126  0.086969234 -0.231561151 -0.407190241
## MEAN.MONTHLY.MIN.TEMP  -0.290881473  0.098544221 -0.124395317 -0.410346533
## MEAN.ANNUAL.RAINFALL    0.246445156  0.652262806  0.074819635  0.036946986
## MEAN.ANNUAL.SUNSHINE    0.026626053  0.003346446 -0.009076770  0.013742342
## MEAN.ANNUAL.WIND.SPEED  0.605345930  0.009253004  0.011956023 -0.014291312
## MEAN.RANGE.AIR.TEMP     0.553793589 -0.012989741 -0.137749087 -0.035443392
## MEAN.CLOUD.COVER       -0.033646560  0.011807013 -0.009814602 -0.005053573
## MAX.RAINFALL            0.065994746 -0.678039420 -0.045824409 -0.008470810
##                                 PC13
## LAT                     0.0070603226
## LON                    -0.0002249761
## ALTITUDE                0.0110655936
## RECORD.LENGTH          -0.0020078827
## MEAN.ANNUAL.AIR.TEMP   -0.0195381409
## MEAN.MONTHLY.MAX.TEMP  -0.6467314720
## MEAN.MONTHLY.MIN.TEMP   0.5603044697
## MEAN.ANNUAL.RAINFALL    0.0045395157
## MEAN.ANNUAL.SUNSHINE   -0.0005152601
## MEAN.ANNUAL.WIND.SPEED  0.0017637039
## MEAN.RANGE.AIR.TEMP     0.5168742141
## MEAN.CLOUD.COVER       -0.0023303971
## MAX.RAINFALL           -0.0071288448
wdbc.pca
## Call:
## princomp(x = wdbc.data, center = TRUE, scale = TRUE)
## 
## Standard deviations:
##       Comp.1       Comp.2       Comp.3       Comp.4       Comp.5       Comp.6 
## 6.653817e+02 8.541786e+01 2.651082e+01 7.391720e+00 6.285379e+00 1.723376e+00 
##       Comp.7       Comp.8       Comp.9      Comp.10      Comp.11      Comp.12 
## 1.341902e+00 6.094794e-01 3.939371e-01 2.896455e-01 1.776972e-01 8.644681e-02 
##      Comp.13      Comp.14      Comp.15      Comp.16      Comp.17      Comp.18 
## 5.622228e-02 4.648825e-02 3.642125e-02 2.526125e-02 1.935774e-02 1.528887e-02 
##      Comp.19      Comp.20      Comp.21      Comp.22      Comp.23      Comp.24 
## 1.357492e-02 1.271995e-02 8.801581e-03 7.579317e-03 5.909075e-03 5.305210e-03 
##      Comp.25      Comp.26      Comp.27      Comp.28      Comp.29      Comp.30 
## 3.978787e-03 3.530384e-03 1.917828e-03 1.675896e-03 1.415907e-03 8.352584e-04 
## 
##  30  variables and  568 observations.

Visualizing PCA

Let’s install ggfortify to explore some ways in which we can visualize PCAs.

library(devtools)
library(ggfortify)

install_github("vqv/ggbiplot", force = TRUE)
## 
## * checking for file ‘/private/var/folders/c6/mjl_76ks1vg4k_9vgyl5mtpw0000gn/T/Rtmpmuv6EU/remotes5e0744559c4a/vqv-ggbiplot-7325e88/DESCRIPTION’ ... OK
## * preparing ‘ggbiplot’:
## * checking DESCRIPTION meta-information ... OK
## * checking for LF line-endings in source and make files and shell scripts
## * checking for empty or unneeded directories
## * looking to see if a ‘data/datalist’ file should be added
## * building ‘ggbiplot_0.55.tar.gz’
library(ggbiplot)

ggbiplot(wbg.pca)

ggbiplot(wdbc.pca)

#let's make it more aestetic

wbg_plot<- ggbiplot(wbg.pca, obs.scale = 1, var.scale = 1, ellipse = TRUE, circle = TRUE)+theme_minimal()

wbg_plot+labs(title = "PCA of yield contributing parameters")

wdbc_plot<- ggbiplot(wdbc.pca, obs.scale = 1, var.scale = 1, ellipse = TRUE, circle = TRUE)+theme_minimal()   

wdbc_plot+labs(title = "PCA of yield contributing parameters")

Another way to visualize PCA is by using the factoextra package

The FactoMineR package can do PCA for you, and the factoextra package is useful in extracting components of the PCA analysis done by the FactoMineR package. Let’s explore the functions from these packages with the WDBC dataset.

##          eigenvalue variance.percent cumulative.variance.percent
## Dim.1  1.327110e+01     4.423701e+01                    44.23701
## Dim.2  5.705847e+00     1.901949e+01                    63.25650
## Dim.3  2.818852e+00     9.396173e+00                    72.65267
## Dim.4  1.975260e+00     6.584200e+00                    79.23687
## Dim.5  1.655397e+00     5.517989e+00                    84.75486
## Dim.6  1.205956e+00     4.019852e+00                    88.77471
## Dim.7  6.715595e-01     2.238532e+00                    91.01324
## Dim.8  4.757283e-01     1.585761e+00                    92.59900
## Dim.9  4.175436e-01     1.391812e+00                    93.99082
## Dim.10 3.512424e-01     1.170808e+00                    95.16162
## Dim.11 2.946575e-01     9.821917e-01                    96.14382
## Dim.12 2.618867e-01     8.729558e-01                    97.01677
## Dim.13 2.413406e-01     8.044687e-01                    97.82124
## Dim.14 1.553811e-01     5.179370e-01                    98.33918
## Dim.15 9.422671e-02     3.140890e-01                    98.65327
## Dim.16 7.852423e-02     2.617474e-01                    98.91501
## Dim.17 5.937738e-02     1.979246e-01                    99.11294
## Dim.18 5.280787e-02     1.760262e-01                    99.28897
## Dim.19 4.953305e-02     1.651102e-01                    99.45408
## Dim.20 3.117460e-02     1.039153e-01                    99.55799
## Dim.21 2.988242e-02     9.960808e-02                    99.65760
## Dim.22 2.737953e-02     9.126509e-02                    99.74886
## Dim.23 2.442628e-02     8.142094e-02                    99.83028
## Dim.24 1.806954e-02     6.023181e-02                    99.89052
## Dim.25 1.551898e-02     5.172994e-02                    99.94225
## Dim.26 7.973523e-03     2.657841e-02                    99.96882
## Dim.27 6.881183e-03     2.293728e-02                    99.99176
## Dim.28 1.594249e-03     5.314165e-03                    99.99708
## Dim.29 7.440627e-04     2.480209e-03                    99.99956
## Dim.30 1.330431e-04     4.434769e-04                   100.00000

## Principal Component Analysis Results for individuals
##  ===================================================
##   Name       Description                       
## 1 "$coord"   "Coordinates for the individuals" 
## 2 "$cos2"    "Cos2 for the individuals"        
## 3 "$contrib" "contributions of the individuals"